The main reason to use GPU is the computational power that it offers. There are two major advantages of using GPU over CPU.
- Computational Throughput
- Extremely High Memory Bandwidth
GPU are designed for Compute Intensive Highly parallel computation.
Therefore more transistors are used for data processing rather than data caching in advanced control logic.
CPU are designed to minimize latency, therefore majority of silicon area is dedicated to :
- Advanced Control logic
- Large cache
Whereas The GPU are designed to maximize throughput, therefore majority of silicon area is dedicated to :
- Massive number of cores(ALU)
- GPU are used for data parallel computations(since a single kernel or function is to be calculated on large data elements, there is lower requirement for sophisticated control logic and hence lower cache)
In order to utilize the GPU, the parts of program which can be parallelized must be decomposed into large number of threads, that can run concurrently. In CUDA, these threads are defined by special functions called as kernels(kernels are functions which are called on device(GPU)).
Execution of kernel is called, launching a kernel. When we launch a kernel, kernel is executed as set of threads. Each thread is mapped to a single CUDA cores(each core can run 32 threads concurrently).
Two main concepts are CPU and GPU. We use the following terminology :
- Host : CPU + It's on chip memory
- Device : GPU + It's dedicated DRAM
Often programs contains the both parts : serial and parallel computations. We will run serial programs on CPU and parallel program code (kernel) on GPU. Using both CPU and GPU together is often termed as Heterogeneous Parallel Programming.
This is where CUDA comes in, as CUDA is 'Heterogeneous parallel programming Language', designed specifically for NVIDIA GPUs.
CUDA is simply C with set of extensions. In CUDA programming model host is in control of the program. Host and device communicate via PCI bus. PCI bus is very slow relative to host and device, thus increasing cost of exchange of data between the CPU and GPU. Therefore the only portions of the code which are executed on device are those which are massively parallel.
Kernels gets executed as a set of parallel threads. CUDA is designed to execute thousands of threads(114,688 threads per core).
CUDA threads execute in a SIMD(Single Instruction Multiple Data) fashion. But for NVIDIA GPUs SIMD is basically SIMT (Single Instruction Multiple Thread).
Note that, the threads don't finish the task at same rate, since they acts upon different data sets, so time can be little different.
In order to organise threads on each cores, CUDA uses hierarchy in threads. There are three levels of hierarchy.
- Threads: At lowest level of hierarchy, we have threads. Kernels execute as a set of threads.
- Blocks: Threads are grouped into Blocks.
- Grids: Blocks are grouped into grids. These are at highest level of hierarchy. The entire kernel launch is mapped to one grid, which is mapped to corresponding part of GPU and its memory. In summary, one kernel is executed as one grid, which is mapped to part or whole of device.
In short, Threads \(\in\) Blocks \(\in\) Grid.
Grids can be 1d, 2d and 3d.
Imagine having 1d grid, with 4 blocks (4 x 1). Each block can be 1d, 2d and 3d thread arrangements.
If blocks are 2d with dimension (4 x 5; 4 thread in x and 5 threads in y), where threads are elements of (5 x 4) matrix. In total we will have 20 threads per block, and for such 4 blocks we will have 80 threads.
Detailed discussion can be found at Indexing Threads within Grids and Blocks
Upon launching such kernel with this grid, a total of 80 threads will be executed concurrently.